Faceting
It might seem like we’re limited with the total number of variables we can display at a time. While there are many types of encodings we could in theory use, only a few them are very effective, and they can interfere with one another.
Not all is lost, though! A very useful idea for visualizing high-dimensional data is the idea of small multiples. It turns out that our eyes are pretty good at making sense of many small plots, as long as there is some shared structure across the plots.
In ggplot2, we can implement this idea using the
facet_wrap and facet_grid commands. We specify
the column in the data frame along which we want to generate comparable
small multiples.
facet_wrap
facet_wrap is most effective for dividing
panels up by a single categorical variable. If you allow
facet_wrap to rely on its defaults, it will try to make a
reasonably “rectangular” decision about how to distribute the
panels.
Unfortunately, the facet commands have a slightly
different syntax than other ggplot commands. Just like
aes() tells ggplot commands to treat the
expression inside as a column of the global dataframe,
facet_wrap uses vars(). (*There is an older
syntax that the facet commands still support which is based
on ~, but as the help page describes, this is now
considered the “classic interface”.)
years = c(1962, 1980, 1990, 2000, 2012)
ggplot(
gapminder %>% filter(year %in% years),
aes(x = fertility, y = life_expectancy, col = continent)
) +
geom_point() +
facet_wrap(vars(year)) # we could have also written facet_wrap(~ year)The fact that there are 5 panels is determined by the data; there are only 5 distinct years in this dataset.
However, the 3-2 arrangement you see above was decided by
facet_wrap. The nrow and ncol
arguments allow you to specify the number of rows or columns, manually.
For example, to specify all panels in one column or all in one row:
ggplot(
gapminder %>% filter(year %in% years),
aes(x = fertility, y = life_expectancy, col = continent)
) +
geom_point() +
facet_wrap(vars(year),
ncol = 1)ggplot(
gapminder %>% filter(year %in% years),
aes(x = fertility, y = life_expectancy, col = continent)
) +
geom_point() +
facet_wrap(vars(year),
nrow = 1)Which configuration is most effective is heavily dependent on the context and shape of your data. However, considering your configuration of panels is extremely important.
facet_grid
facet_grid is most effective for dividing
panels up by combinations of two categorical variables. Ideally, each
panel will have some data in it, and the two variables will not be
redundant with each other.
The first two arguments to facet_grid are
vars(row_variable) and
vars(column_variable).
ggplot(
gapminder %>% filter(year %in% years),
aes(fertility, life_expectancy, col = continent)
) +
geom_point() +
facet_grid(vars(year), vars(continent)) + # We could have also written facet_grid(year ~ continent)
theme(legend.position = "bottom")It is not useful, for example, to facet by
century and year, because every year is only
in one century.
ggplot(
gapminder %>% filter(year %in% years),
aes(fertility, life_expectancy, col = continent)
) +
geom_point() +
facet_grid(year ~ ifelse(year < 2000, "1900s", "2000s")) +
theme(legend.position = "bottom")Extended Example
This example also shows that faceting will apply to multiple
geom layers at once.
The dataset shows the abundances of five different bacteria across three different subjects over time, as they were subjected to antibiotics. The data were the basis for this study.
antibiotic <- read_csv("https://uwmadison.box.com/shared/static/5jmd9pku62291ek20lioevsw1c588ahx.csv")
head(antibiotic)## # A tibble: 6 × 7
## species sample value ind time svalue antibiotic
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 Unc05qi6 D1 0 D 1 NA Antibiotic-free
## 2 Unc05qi6 D2 0 D 2 NA Antibiotic-free
## 3 Unc05qi6 D3 0 D 3 0 Antibiotic-free
## 4 Unc05qi6 D4 0 D 4 0 Antibiotic-free
## 5 Unc05qi6 D5 0 D 5 0 Antibiotic-free
## 6 Unc05qi6 D6 0 D 6 0.2 Antibiotic-free
I have also separately computed running averages for each of the
variables – this is in the svalue column. We’ll discuss
ways to do this during the week on time series visualization.
ggplot(antibiotic, aes(x = time)) +
geom_line(aes(y = svalue), linewidth = 1.2) +
geom_point(aes(y = value, col = antibiotic), size = 0.5, alpha = 0.8) +
facet_grid(species ~ ind) +
scale_color_brewer(palette = "Set2") +
theme(strip.text.y = element_text(angle = 0))Notice the y axis for every row is the same. That is A) the default and B) a little unhelpful here. There is one point, in the bottom middle panel, that is stretching out the y axis for the rest of the plot.
In this situation - but not all situations - it might make sense to allow the y axis to vary across panels. This carries a strong risk of misinterpretation - people will naturally assume, for good reason, that all axes/visual distances represent the same numeric distances.
However, the scale argument to the facet
commands give you the option to “unrestrain” the axes; in both
dimensions ("free") or in just one ("free_x"
or "free_y").
ggplot(antibiotic, aes(x = time)) +
geom_line(aes(y = svalue), size = 1.2) +
geom_point(aes(y = value, col = antibiotic), size = 0.5, alpha = 0.8) +
facet_grid(species ~ ind, scale = "free_y") + # Notice how this addition allows the Y axis to vary across rows!
scale_color_brewer(palette = "Set2") +
theme(strip.text.y = element_text(angle = 0))In any faceting scenario, the order of the panels is determined by the internal order assigned to the categorical variable. Unless you tell it what that order is, it will assume alphabetical/numerical order.
(One notable exception is lubridate’s
wday(), which returns a factor with internal order Monday
< Tuesday < Wednesday < Thursday …)
Even in this scenario where there isn’t an intuitive
Monday/Tuesday/Wednesday-like order, we can still make an effective
decision about the order! We can define an order based on the average
value of the responses over the course of the survey, and then change
the factor levels of the Species column to reorder the
panels.
species_order <- antibiotic %>%
group_by(species) %>%
summarise(avg_value = mean(value)) %>%
arrange(desc(avg_value)) %>%
pull(species)
antibiotic <- antibiotic %>%
mutate(species = factor(species, levels = species_order))
ggplot(antibiotic, aes(x = time)) +
geom_line(aes(y = svalue), size = 1.2) +
geom_point(aes(y = value, col = antibiotic), size = 0.5, alpha = 0.8) +
facet_grid(species ~ ind, scale = "free_y") +
scale_color_brewer(palette = "Set2") +
theme(strip.text.y = element_text(angle = 0))Now, the top row contains the most abundant species on average, all the way down to the bottom row with the least abundant.
However, there’s a strong argument that that pattern is somewhat
obscured by the fact that we still have the scales set to
"free_y"… the effects of these visual choices intersect,
each one’s pros and cons can’t be judged in a vacuum!